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Abstract — An "entropy increasing to the maximum" result 
analogous to the entropic central limit theorem (Barron 1986; 
Artstein et al. 2004) is obtained in the discrete setting. This 
involves the thinning operation and a Poisson limit. Monotonic 
convergence in relative entropy is established for general discrete 
distributions, while monotonic increase of Shannon entropy is 
proved for the special class of ultra-log-concave distributions. 
Overall we extend the parallel between the information-theoretic 
central limit theorem and law of small numbers explored by 
Kontoyiannis et al. (2005) and Harremoes et al. (2007, 2008). 
Ingredients in the proofs include convexity, majorization, and 
stochastic orders. 

Index Terms — binomial thinning; convex order; logarithmic 
Sobolev inequality; majorization; Poisson approximation; relative 
entropy; Schur-concavity; ultra-log-concavity. 



I. Introduction 

The information-theoretic central limit theorem (CLT, [|4]) 
states that, for a sequence of independent and identically 
distributed (i.i.d.) random variables Xi, i — 1,2,..., with 
zero mean and unit variance, the normalized partial sum 
Z n = Xi/y/n tends to N(0, 1) as n — > oo in relative 

entropy, as long as the relative entropy D(Z n \N(0, 1)) is 
eventually finite. An interesting feature is that D(Z n \N(0, 1)) 
decreases monotonically in n, or, equivalently, the differential 
entropy of Z n increases to that of the standard normal. While 
this monotonicity is an old problem ([24 1), its full solution is 
obtained only recently by Artstein et al. J2J; see Tulino and 
Verdu JSH, Madiman and Barron J26), and Shlyakhtenko ifJTI . 
[32 1 for ramifications. In this paper we establish analogous 
results for a general version of the law of small numbers, 
extending the parallel between the information-theoretic CLT 
and the information-theoretic law of small numbers explored 
in OH |f23] fl3] and O. Such monotonicity results are 
interesting as they reveal fundamental connections between 
probability, information theory, and physics (the analogy with 
the second law of thermodynamics). Moreover, the associated 
inequalities are often of great practical significance. The 
entropic CLT, for example, is closely related to Shannon's 
entropy power inequality (Q, ll33l ). which is a valuable tool 
in analyzing Gaussian channels. 

Informally, the law of small numbers refers to the phe- 
nomenon that, for random variables Xi on Z + = {0, 1, . . .}, 
the sum JD i=1 Xi has approximately a Poisson distribution 
with mean A = Yl7=i E-X» as long as i) each of X t is such 
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that Pr(X, = 0) is close to one, Pr(A^ = 1) is uniformly 
small, and Pr(X{ > 1) is negligible compared to Pr(X, = 1); 
and ii) the dependence between the AYs is sufficiently weak. 
In the version considered by Harremoes et al. [15| [16| and in 
this paper, the Xi's are i.i.d. random variables obtained from 
a common distribution through thinning. (Indeed, Harremoes 
et al. term their result "the law of thin numbers".) The notion 
of thinning is introduced by Renyi f29\ . 

Definition 1: The a-thinning (a <E (0, 1)) of a probability 
mass function (pmf) / on Z + , denoted as T a (f), is the pmf of 
Y^T—i Xi, where Y has pmf / and, independent of Y, Xi, i = 
1,2,.. ., are i.i.d. Bernoulli(aO random variables, i.e., Pr(X; = 
1) = 1 - PrpQ = 0) = a. 

Thinning is closely associated with certain classical distri- 
butions such as the Poisson and the binomial. For the Poisson 
pmf po(X) = {po(i;X), i = 0,1,...}, with po(i; A) = 
A J e~ A /i!, we have 



T a (po(X)) = po(aX). 



For the binomial pmf bi(n,p) — {bi(i;n,p), i 
with bi(i;n,p) = — p) n ~\ we have 

T a (bi(n,p)) — bi(n,ap). 



,n}, 



Basic properties of thinning also include the semigroup rela- 
tion (||T9l) 

T ot (T /3 (f))=T af3 (f). (1) 

Thinning for discrete random variables is analogous to scaling 
for their continuous counterparts. 

The n-th convolution of /, denoted as /*", is the pmf of 
Y^H=i Yi where Yi's are i.i.d. with pmf /. It is easy to show 
that thinning and convolution operations commute, i.e., 



T a (f* n ) = (Ta(f))* n . 



(2) 



Using the notions of thinning and convolution, we can state 
the following version of the law of small numbers considered 
by Harremoes et al. lfT31l . As usual, for two pmfs / and g, 
the entropy of / is defined as H (f) = — J^- fa log(/j), and 
the relative entropy between / and g is defined as D(f\g) = 
Si /ilog(/i/ffi)- 11 is understood that D(f\g) = oo if the 
support of /, supp(f) = {i : /, > 0}, is not a subset of 
supp(g). We frequently consider the relative entropy between 
a pmf / and po(X), where A is the mean of /; we denote 

D(f) = D(f\po(X)) 

for convenience. 

Theorem 1: Let / be a pmf on Z + with mean A < oo. 
Then, as n — > oo, 
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1) Tij n {f* n ) tends to po(X) pointwise; 

2) H(T 1/n (f* n )) "> H(po(X)); 

3) if D(T 1 f n (f* n )) ever becomes finite, then it tends to 
zero. 

Part 1) of Theorem QJ is proved by Harremoes et al. lfT31 . 
who also present a proof of Part 3) assuming D(f) < oo. The 
current, slightly more general form of Part 3) is reminiscent 
of Barron's work (4) on the CLT In Section II we present a 
short proof of Part 3). We also note that Part 2), which is stated 
in |[T6l with a stronger assumption, can be deduced from 1) 
directly. 

A major goal of this work is to establish monotonicity 
properties in Theorem QJ We show that, in Part 3) of Theorem 
QJ the relative entropy never increases (Theorem |2), and, 
assuming / is ultra-log-concave (see Definition |2}, in Part 
2) of Theorem QJ the entropy never decreases (Theorem |5J. 
Both Theorems H] and |3] can be regarded as discrete analogues 
of the monotonicity of entropy in the CLT ([2]), with thinning 
playing the role of scaling. (Unlike the CLT case, here 
monotonicity of the entropy and that of the relative entropy 
are not equivalent.) We begin with monotonicity of the relative 
entropy. 

Theorem 2: If / is a pmf on Z + with a finite mean, then 
D(T 1/n (f* n )) decreases on n = 1,2, . . .. 

The proof of Theorem [2] uses two Lemmas, which are of 
interest by themselves. These deal with the behavior of relative 
entropy under thinning (LemmaQJi and convolution (Lemma|2]i 
respectively. Lemma QJ is proved in Section EH, where we also 
note its close connection with modified logarithmic Sobolev 
inequalities (Bobkov and Ledoux Q; Wu 11351 ) for the Poisson 
distribution. 

Lemma 1 (The Thinning Lemma): Let / be a pmf on 

Z + with a finite mean. Then 

D(T a (f)) < aD(f), 0<a<l. 

An equivalent statement is that a~ 1 D(T a (f)) increases in 
a € (0, 1], in view of the semigroup property ([Til- 
Combined with a data processing argument, Lemma QJ can 
be used to show that the relative entropy is monotone along 
power-of-two iterates in Theorem|2] To prove Theorem[2]fully, 
however, we need the following convolution result, which may 
be seen as a "strengthened data processing inequality." Lemma 
H] is proved in Section IV. 

Lemma 2 (The Convolution Lemma): If / is a pmf on 
Z + with a finite mean, then (l/n)D (/*") decreases in n. 

The main difference in the development here, compared 
with the CLT case, is that we need to consider the effect 
of both thinning and convolution. In the CLT case, the 
monotonicity of entropy can be obtained from one general 
convolution inequality for the Fisher information ([2], [26 1). 
Nevertheless, the proofs of Lemmas QJ and [2] (Lemma [2] in 
particular) somewhat parallel the CLT case. We first express 
the desired divergence quantity as an integral via a de Bruijn 
type identity ([33 1, [5|, [4Q, and then analyze the monotonicity 
property of the integrand; see Sections III and IV for details. 

Once we have Lemmas [TJ and [2] Theorem [2] is quickly 
established. 



Proof of Theorem [2} Lemma QJ and (TTJ imply (n > 2) 

-— iD {T l/n {r n )) < D (Ti/( n _i) (/*")) . 
Lemma [2] and (fJJ then yield 

D (^i/(n-i) (/*")) < (T 1/( „_ 1} (Z^- 1 ))) 

and the claim follows. ■ 
By a different analysis, we also establish the monotonicity 
of H(Ti/ n (f* n )), under the assumption that / is ultra-log- 
concave. 

Definition 2: A nonnegative sequence u = {ui, i <E Z+} 
is called log-concave, if the support of u is an interval of 
consecutive integers, and uf > u^iUi+i for all i > 0. A pmf 
/ is ultra-log-concave , or ULC, if the sequence ilfi, i € Z+, 
is log-concave. 

Equivalently, / is ULC if ifi/fi-i decreases in i. It is 
clear that ultra-log-concavity implies log-concavity. Examples 
of ULC pmfs include the Poisson and the binomial. More 
generally, the pmf of Y2?=i 15 ULC if Xj's are indepen- 
dent (not necessarily identically distributed) Bernoulli random 
variables. 

The monotonicity of entropy is stated as follows. 

Theorem 3: If / is ULC, then H(T 1/n (f* n )) increases 
monotonically on n = 1, 2, 

An example ( ITT41 |36|) is when / is a Bernoulli with 
parameter p, in which case Ti/ n (/* n ) = bi(n,p/n). In other 
words, both the entropy and the relative entropy are monotone 
in the classical binomial-to-Poisson convergence. 

It should not be surprising that we make the ULC assump- 
tion; the situation is similar to that of a Markov chain with 
homogeneous transition probabilities ([10], Chapter 4): relative 
entropy always decreases, but entropy does not increase with- 
out additional assumptions. The ULC assumption is natural in 
Theorem [3] because ULC distributions with the same mean A 
form a natural class in which the Po(A) distribution has max- 
imum entropy [19]. In fact, if we reverse the ULC assumption 
(but still assume that / is log-concave), then H(Tx/ n (f* n )) 
decreases monotonically (Theorem [7). Theorems [3] and [7] are 
proved in Section VI. The starting point in these proofs is a 
general result (Lemma @]l that relates entropy comparison to 
comparing the expectations of convex functions. This entails 
a rather detailed analysis of the convex order (to be defined 
in Section V) between the relevant distributions. 

As a simple example, Fig. 1 displays the values of 

d(n) = D(T 1/n (f* n )), t(n) = nD(T 1/n (f)l 
r(n) = n- l D{f* n ), h(n) = H(T 1/n (f m )) 

for / = bi(2, 1/2) and n — 1, . . . , 10. The monotone patterns 
of d(n), t(n), r(n) and h(n) illustrate Theorem [2] Lemma 
[TJ Lemma [2] and Theorem [3] respectively. 

Besides monotonicity, an equally interesting problem is the 
rate of convergence. In Section VII we show that, if / is ULC 
or has finite support, then D(T 1 / n (/*")) = 0(n~ 2 ), n — > oo. 
This complements certain bounds obtained by Harremoes et al. 
[15], [16|. Different tools contribute to this 0{n~ 2 ) rate. For 
ULC distributions we use stochastic orders as in Section VI; 
for distributions with finite support, we simply analyze the 
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d(n)=D(bi(2n, 1/(2n))) 



t(n)=nD(bi(2, 1/(2n))) 




r(n)=(1/n)D(bi(2n, 1/2)) 



h(n)=H(bi(2n, 1/(2n))) 



2 4 6 



2 4 6 8 10 



Fig. 1. Values of d(n), t(n), r(n) and h(n) for n = 1, . . . , 10. 

scaled Fisher information ([23|, [27 j). We conclude with a 
discussion on possible extensions and refinments (of Theorem 
12 in particular) in Section VIII. 

II. The convergence theorem 

This section deals with Theorem Q] Part 1) of Theorem Q] is 
proved in [15|. Part 2) is stated in lfl"6l with the assumption 
that / is ultra-logconcave. The present form only assumes that 
A, the mean of /, is finite. Part 2) can be quickly proved as 
follows. Part 1) and Fatou's lemma yield 

limM H(T 1/n (f* n ))>H(po(X)). 

n — >oo 

Let g denote the pmf of a geometric(p) distribution, i.e., 
9i = P(l — * = 0, 1, ... , < p < 1. By the lower- 
semicontinuity property of relative entropy, 

liminf D{T 1/n {f* n )\g) > D(po(X)\g). (3) 

n — >oo 

Since the mean of T 1/n (f* n ) is A for all n, ® simplifies to 

lim sup #(T 1/n (/*")) <H(po{\)) 

n — >oo 

and Part 2) is proved. 

Our proof of Part 3) uses convexity arguments that also yield 
some interesting intermediate results (Propositions |2] and |3j. 
In Propositions Q~|- [3] let Xi,X2, . . . , be i.i.d. with pmf /. 

Proposition 1: For any a € (0, 1], D(T a (f)) < oo if and 
only if EX\ log(Xi) < oo (as usual OlogO = 0). 

Proof: Let us consider a — 1 first. Note that H(f) is 
finite since the mean of / is finite. We have 



D(f) = E & l0 S( l! ) ~ A lo §( A ) + A - ff (/)■ 



(4) 



Thus D(f) <C oo if and only if Xa>o /» l°g(*0 converges, 
which, by Stirling's formula, is equivalent to EX\ log(ATi) < 
oo. 

For general a G (0,1], let Y\X X - Bi(Xi,a). By 
the preceding argument D(T a (f)) < oo if and only if 
EY\og(Y) < oo. However 

EaX 1 log(aX 1 ) < EYlog(Y) < EX 1 log(Xi) 



where the lower bound holds by Jensen's inequality. Thus 
EY\og(Y) < oo is also equivalent to -EXilog(Xi) < oo. 

■ 

A consequence of Proposition Q] is that, in Part 3), 
D(T 1/n (f* n )) EX n \og(X n ) < oo. 

Here and in Propositions [2] and [3] below, X n = 

Proposition 2: For n > 1, 



A 



X„ 



D(T 1/n (f* n ))<-+EX n \og . 

11 A 

Proof: We borrow an idea of [15| used in the proof of 
their Proposition 8. Letting g = f* n , we have 

D(T 1/n (g)) = D lf^g k bi(kA/n) J 
\fe=o / 

oo 

<J29kD(bi(k,l/n)\po(X)) 

by convexity. However, 

D(bi(k,p)\po(X)) = D(bi(k,p)) + D(po{kp)\po(X)) 

< kp 2 + kp log ^ - kp + X 
X 

where the simple bound D(bi(k,p)) < kp 2 (see IT41 for its 
proof) is used in the inequality. Thus 



D(T l/n (g)) <Y,9k 

k=0 



k k k 

- log — - - + A 

n nX n 



= ±+EX n \og^ 
n X 

as required. ■ 

Proposition 3: Denote l n = EX n \og(X n /X). Then, as 

n | oo, l n decreases to zero if it is finite for some n. 

Proof: By Jensen's inequality, l n > 0. Noting X n = 

EX n -i\X n , we apply Jensen's inequality again to get 

l n < EE [X„-ilog(X„-i/A)|X„] = ln-1- 

(Essentially we are proving X n < cx X n -i where < cx denotes 
the convex order; see 1301 . Section V contains a brief intro- 
duction to several stochastic orders.) Thus l n [ l^, say, with 

loo > 0. 

We show = 0, assuming Zj. < oo for some k. By 
symmetry l n = EXk \og(X n / X) 1 n > k. We may use this 
and Jensen's inequality to obtain 

EX n \Xk 



L < EX h log 



EX u \og 



_ X 

kX k + (n- k)X 
nX 



(5) 



However, 

- kX k + (n- k)X - f X k 

X k log < X k max <^ 0, log — 

nX [A 

and the right hand side has a finite expectation since l k < oo. 
Letting n — * oo in (O and using Fatou's lemma we obtain 



fno < EX k lOE 



A 
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which forces 1^ = 0. ■ 
Part 3) is then a direct consequence of Propositions Q] - [3] 

III. Lemma[T]and a Modified Logarithmic Sobolev 
Inequality 

For any pmfs g and g on Z + , we have 

D(T a (ff)|T a (0)) < D(g\g). (6) 

This is a special case of a general result on the decrease of 
relative entropy along a Markov chain (see IflOl , Chapter 4). 
It follows from (|6]i and the semigroup property (Q]i that, in the 
setting of LemmaQ] D(T a (f)) increases in a. This is however 
not strong enough to prove Lemma [T] yet. 

Let us recall the size-biasing operation, which often appears 
in Poisson approximation problems. 

Definition 3: For a pmf / on Z + with mean A > 0, the 
sized-biased pmf, denoted by S(f), is defined on Z + as 

S(f) = {(i + l)f i+1 /X, i = 0,l,...}. 

The formulas S(po(X)) = po(X) and S(bi(n,p)) — bi(n — 
l,p) are readily verified. Moreover, size-biasing and thinning 
operations commute, i.e., 



T a (S(f)) = S(T a (f)). 



(7) 



Key to the proof of Lemma [TJ is the following identity; see 
Johnson |19| for related calculations. 

Lemma 3: Let / = {/;, i > 0} be a pmf on Z + with 
mean A 6 (0, oo), and assume that the support of / is finite, 
i.e., there exists some k such that /j = for all i > k. Then 



dD(T a (f)) 
da 



XD(T a (S(f))\T a (f)), a 6(0,1). (8) 



Proof: Write g — T a (f) for convenience, i.e.. 



9i =^2f]bi(i;j,a). 

j>0 



By direct calculation 



dD(g) 
da 



i>0 



po(i; a\) 



- ^2 jfj[M{i- 1)3 - 1,") - bi{i;j 

i>0,j>l 

x lor 9l 



l,a) 



po(i; aX) 
jfjbi(i - 1; j - 1, 



i>l,j>l 

X log 



9i 



log- 



9i-i 



po(i;aX) po(i—l;aX) 

(i + l)g i+ i 



=^^(S(/))>( i; j, Q )log 

i>0 j>0 

--XD(T a (S(.f))\T a (f)) 



where the simple identity 

d(bi(i; n,p)) 



dp 



n[bi(i — l;n — l,p) — bi(i; n — l,p)] 



is used in the second step, and Abel's summation formula in 
the third. (By convention bi(i;n,p) — if i < or i > n.) 
All sums are finite sums since / has finite support. ■ 
Remark. The assumption that / has finite support does not 
appear to impose a serious limit on the applicability of Lemma 
[3] Of course, it would be good to see this assumption relaxed. 

Proof of Lemma Q} Let us first assume that / has finite 
support. Then D(T a (f)) is obviously continuous on a 6 [0,1]. 
Lemma [3] and © show that dD(T a (f))/da increases on 
a £ (0,1). Thus D(T a (f)) is convex on a e [0,1], and 
the claim follows. For general /, we construct a sequence of 
pmfs /< fc ) = {f\ k \ i > 0}, k = 1,2,..., by truncation. 
In other words, let = Cfe/,-, i = 0, . . . , k, where 

c k = (E,<* A) -1 . and if ) = 0, * > k. Assume £>(/) < oo 
without loss of generality. Then T Q (/ (fc) ) tends to T a (f) 
pointwise as k — ► oo. It is also easy to show 



<■(*) 



Thus, by the finite-support result and the lower-semi- 
continuity property of the relative entropy, we have 

D(T a (f)) <Hminf£>(T Q (/« 
< liminf aD f/ (fc) 
= «£»(/) 

as required. ■ 
For two pmfs / and g on Z + with finite means, the data- 
processing inequality ( IflOl ) gives (* denotes convolution) 

D(T a (f) * T p {g)) < D(T a (f)) + D{T {g)) (9) 
where a, j3 € [0, 1]. By Lemma[T] we have 



D(T a (J) * Tfs(g)) < aD(J) + (3D(g). 



(10) 



This is enough to prove Theorem |2] in the special case of 
power-of-two iterates, i.e., D(Ti/ n (f* n )) decreases on n = 

2 k , k = 0, 1, To establish Theorem |2] fully, we need a 

convolution inequality stronger than (0, namely Lemma |2j 
Section IV contains the details. 

A result closely related to Lemma Q] is Theorem |4] which 
was proved by Wu ([35 1, Eqn. 0.6) using advanced stochastic 
calculus tools (see J6), JS], J5] for related work). Our proof 
of Theorem |4] based on convexity, is similar in spirit to those 
given by QS], (9); the use of thinning appears new. 

Theorem 4 (I35|): For a pmf / on Z + with mean A S 
(0, oo) we have 



D(f) < XD(S(f)\f). 



(11) 



Proof: Let us assume the support of / is finite. The 
convexity of h(a) — D(T a (f)) implies h'{a) > h(a)/a for 
all a £ (0, 1). If D(S(f)\f) < oo then supp(f) is an interval 
of consecutive integers including zero. We may let a — * 1 and 
obtain 

XD(S(f)\f) = ]imh'(a) > h(l) - £>(/). 

When the support of / is not finite, an argument similar to 
the one for Lemma Q] applies. ■ 
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Theorem [4] sharpens a modified logarithmic Sobolev in- 
equality originally obtained by Bobkov and Ledoux |6| . 

Corollary 1 (|61, Corollary 4): In the setting of Theorem 
|4j assume that /,; > for all i E Z+. Then 

D(f) < \ X 2 (S(f), /) (12) 

where X 2 (S(f), /) = fi (W))<//< ~ if- 

The inequality (fT2l follows from Theorem [4] and the well- 
known inequality between the relative entropy and the \ 2 
distance. For an application of ( TTZI ) to Poisson approximation 
bounds, see (23]. 

IV. Relative entropy under convolution 

This section establishes Lemma [2] The starting point is 
an easily verified decomposition formula (Proposition 2]). 
Proposition [4] was used by Madiman et al. l27l to derive a 
convolution inequality ( ll27ll . Theorem III) for the scaled Fisher 
information, which is Xx 2 (S(f), f) as in (fT2b . Here we obtain 
a monotonicity result (Corollary [2]) for the relative entropy 
D(S(f* n )\f* n ), which is instrumental in the proof of Lemma 

El 

Proposition 4 (I27), Eqn. 14): Let gW be pmfs on Z + 
with finite means A;, i = l,...,n, respectively (n > 2). 
Define q = q^ * ... * and q^ l > — q^ 1 ' * ... * q^ 1 " 1 * 
* ... * (i.e., g« is left out), i = l,...,n. Then 
there holds 

n 

S(q)=J2&<l (i) *s(<l ( - i) ) 

i=l 

where 0i = (1 — Aj/ Ej=i ^j)/i n ~ !)• (I n statistical terms, 
we have a mixture representation of S(q).) 

Proposition 5: In the setting of Proposition [4] we have 

n 

D(q\S(q)) < (g(-«)|5 (?<-'*>)) ; (13) 

n 

£>(5(g)k) < ( 5 ■ (14) 

»=i 

Proof: We prove dT3l) ; the same argument applies to ( fT4l ). 
By convexity, Proposition [4] yields 

n 

D(q\S(q)) <J2&D (q\q (l) * S (q^)) . 
i=l 

However, since g = gW * g( _ 'J for each i, we have 

D (g|g» * 5 (,(-'))) <D( 9 <-*)|ff («<-*))) 

by data processing, and the claim follows. ■ 
Corollary |2] corresponds to the case of identical gM's in 
Proposition 

Corollary 2: For any pmf / on Z + with mean A G (0, oo), 
both D (5 (/*") |/* n ) and D (/*")) decrease in n. 

Proof of Lemma^} Let us assume that / has finite support 
first. We have (JHJ in the integral form 

-D(f* n ) =\f D(T a (S(r n ))\T a (f* n )) da (15) 
n Jo 

= \[ D(S((T a (f)r n WM)T n )da (16) 
Jo 



where ( fT6] l holds by the commuting relations (0 and ©. By 
Corollary [2] the integrand in ( [ToT l decreases in n for each a. 
Thus (l/n)D(/* n ) decreases in n as claimed. For general 
/, we again use truncation. Specifically, let /W and Ck be 
defined as in the proof of Lemma Q] For n > 2 let g — /*", 
and similarly let g^ denote the nth convolution of f^ k \ Then 
g( k > tends to g pointwise, and the mean of g^ tends to that of 
g. Assume D(g) < oo, which amounts to J^i 9i log(i!) < oo. 
The argument for Part 2) of Theorem Q] shows 

H (g { V) H{g), k -> oo. (17) 

We also have the simple inequality g\ < c^gi for all i. Since 
Ck — > 1 as fc — > oo, we may apply dominated convergence to 
obtain 

log(«!)->5Zffilog(i!), /c->oo, 

which, taken together with dT7t , shows 

D ^ D( 5 ), fc^oo. 

The finite-support result and the lower-semicontinuity property 
of relative entropy then yield 

i d (/*(«+!)) < Id CT») 

n + 1 V / rt 

as in the proof of Lemma Q] ■ 
A generalization of Lemma [2] is readily obtained if we use 

Proposition |5] rather than Corollary [2] in the above argument. 
Theorem 5: In the setting of Proposition [4] 

Theorem [5] strengthens the usual data processing inequality 

n 

D( ? )<^D( ? «) 

z=l 

in the same way that the entropy power inequality of Art- 
stein et al. [2| strengthens Shannon's classical entropy power 
inequality. 

Remark. A by-product of Corollary|2]is that the divergence 
quantities 

h n = D(T 1/n (r n )\S(T 1/n (f* n ))) and 
h n =D(S(T 1/n (t n ))\Ti /n (t n )) 
also decrease in n. Indeed we have 

K = D(T 1/n (t n )\Ti/n(S(f* n ))) 

< DiT./^in^/^DiSiD)) (18) 
= £>((T 1/(n _ 1) (/))*»|5((T 1/( „_ 1) (/))*")) 

< K-i (19) 

where (O is used in ( fT8l , Corollary [2] is used in ( [T9l >, and 
the commuting relations (0 and (O are applied throughout. 
The proof for h n is the same. These monotonicity statements 
complement Theorem [2] 
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V. Stochastic orders and majorization 

The proof of the monotonicity of entropy (Theorem O 
involves several notions of stochastic orders which we briefly 
introduce. 

Definition 4: For two random variables X and Y with 
pmfs / and g respectively, 

> X is smaller than Y in the usual stochastic order, written 
as X < st Y, if Pi{X > c) < Pr(T > c) for all c; 

« X is smaller than Y in the convex order, written as 
X < cx Y, if E(j){X) < E(j)(Y) for every convex function 
such that the expectations exist; 

> X is log-concave relative to Y, written as X <i c Y, if 
i) both supp(f) and supp(g) are intervals of consecutive 
integers, ii) supp(f) C supp(g), and iii) log(/j/<7i) is 
concave on supp(f). 

We use < st , < C x, <ic with the pmfs as well as the random 
variables. In general, / < st g if there exist random variables 
X and Y with pmfs / and g respectively such that X < Y 
almost surely. Examples include 

bi(n,p) < st bi(n + l,p), bi(n,p) < st bi(n,p'), p < p' . 

In contrast, < cx compares variability. A classical example 
(Hoeffding [18|) is 

bi (n, X/n) < cx bi (n + 1, A/(n + 1)) , < A < n. 

Another example mentioned in Section II is X n < cx X n _x 
where X n = (1/n) E"=x -^i f° r i-i-d- Xi's with a finite mean. 
The log-concavity order <i c is also useful in our context; for 
example, / being ULC can be written as / <; c po(X), X > 0. 
(The actual value of A is irrelevant.) Further properties of these 
stochastic orders can be found in Shaked and Shanthikumar 

ED. 

We also need the concepts of majorization and Schur 
concavity. 

Definition 5: A real vector b = (bi,...,b n ) is said to 
majorize a = (ax, ... , a n ), written as a -< b, if 

• ELx °* = £"=i„ 6t ' and 

• ElU a W ^ Er=fe 6 W> k = 2,...,n, where a (1) < 

• • • < O(n) an d &(X) < • • • < &(n) are (Ol) • ■ ■ 7 a n) an d 

(bi, . . . ,b n ) arranged in increasing order, respectively. 

A function 0(a) symmetric in the coordinates of a = 
(ax, . . . , a n ) is said to be Schur concave, if 

a -<! b ^(a) > 0(b). 

As is well-known, if pmfs / and g on {0, . . . , n} (viewed 
as vectors of the respective probabilities) satisfy / -< g, then 
H(f) > H(g). In other words H(f) is a Schur concave 
function of /. Further properties and various applications of 
these two notions can be found in Hardy et al. [ 1 3 1 and 
Marshall and Olkin 1251 . 

VI. Monotonicity of the entropy 

This section proves Theorem [3] We state a key lemma that 
can be traced back to Karlin and Rinott 11221 . 



Lemma 4: Let / and g be pmfs on Z + such that / < cx g 
and g is log-concave. Then 

H(f)+D(f\g)<H(g). 

In particular H(f) < H(g) with equality only if / = g. 

Although Lemma [4] follows almost immediately from the 
definitions (hence the proof is omitted), it is a useful tool in 
several entropy comparison contexts ([22], (36), ll38l . [139 1). 
Effectively, Lemma |4] reduces entropy comparison to two 
(often easier) problems: i) establishing a log-concavity result, 
and ii) comparing the expectations of convex functions. A 
modification of Lemma |4] is used by [36| to give a short and 
unified proof of the main theorems of lfl9l and ll37l concerning 
the maximum entropy properties of the Poisson and binomial 
distributions. We quote Johnson's result. Further extensions to 
compound distributions can be found in lETl . ||39l . 

Theorem 6: If a pmf / on Z + is ULC with mean A, then 
H(f) < H(po(X)), with equality only if / =po{X). 

To apply Lemma H] to our problem, we show that, in the 
setting of Theorem [3] 

Ti/(„-i)(r (M )< cl T 1/n (r). (20) 

In a sense, ( f20b means that Ti/ n (f* n ) becomes more and 
more "spread out" as n increases. On the other hand, it can 
be shown that Tx/„(/* n ) is log-concave for all n. Indeed, 
/ is ULC and hence log-concave. It is well-known that 
convolution preserves log-concavity. That thinning preserves 
log-concavity is sometimes known as Brenti's criterion Q 
in the combinatorics literature. Thus I\/„(/* n ) remains log- 
concave. Actually, since / is ULC, there holds the stronger 
relation 

T 1/n {D <icPo(X). (21) 

Relation ( l2lT i follows from i) if / is ULC then so is /*" 
(Liggett J25l) and ii) if / is ULC then so is T a (f) (Johnson 
Ifl9l , Proposition 3.7). 

The core of the proof of Theorem [3] is proving ( f20b . The 
notions of majorization and Schur concavity briefly reviewed 
in Section V are helpful in formulating a more general (and 
easier to handle) version of ( f2Qb - 

Proposition 6: Let Yi,...,Y n be i.i.d. random variables 
on Z+ with an ultra-log-concave pmf /. Conditional on the 
Yi's, let Zi, i = 1,. ..,n, be independent Bi(Yi,pi) random 
variables respectively, where pi,...,p n 6 [0,1]. Let be 
a convex function on Z+. Then -E0(E™=i ^i) ^ s a Schur 
concave function of (px, • • ■ ,Pn) on [0, 1]™. 

The proof of Proposition [6] somewhat technical, is collected 
in the appendix. 

Proof of ([20|): Noting that 

(1/n,..., 1/n) X (l/(n-l),...,l/(n-l),0) 

the claim follows from Proposition [6] and the definition of 
Schur-concavity. ■ 
Theorem [3] then follows from (1201 . (fSTJ an d Lemma H] 
Remark. Theorem [3] resembles the semigroup argument of 
Johnson |fl9l in that both are statements of "entropy increasing 
to the maximum," and both involve convolution and thinning 
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operations. The difference is that Ifl9l considers convolution 
with a Poisson while we study the self-convolution /*". 

As mentioned in Section I, if we reverse the ULC assump- 
tion (but still assume log-concavity), then the conclusion of 
Theorem [3] is also reversed. 

Theorem 7: Let / be a pmf on Z + with mean A. Assume / 
is log-concave, and assume po(X) <i c f. Then H(T 1 / n (f* n )) 
decreases in n. 

Theorem [7] extends a minimum entropy result that parallels 
Theorem [6] 

Proposition 7 (I36|): The Po(A) distribution achieves 
minimum entropy among all pmfs / with mean A such that / 
is log-concave and po(X) <i c f. 

An example of Theorem [7J also noted in 11361 . is when / is 
a geometric (p) pmf, in which case 2i/ n (/* n ) = nb(n, n/(n — 
1 + 1/p)). (Here nb(n,p) denotes the negative binomial pmf 
with parameters (n,p), i.e., nb(n,p) = — 
pY, i = 0, 1, . . .}.) In other words, the negative-binomial- 
to-Poisson convergence is monotone in entropy (as long as 
the first parameter of the negative binomial is at least 1). 

The proof of Theorem [7J parallels that of Theorem [3] In 
place of ( f20b we have 

Ty n (f* n ) <<* (22) 

assuming po(X) <i c f. The proof of ( |20b applies after revers- 
ing the direction of <; c in the relevant places. As noted before, 
since / is log-concave, T 1 / n (f* n ) is log-concave for all n. 
Thus Theorem [7] follows from Lemma |4] as does Theorem [3] 
Incidentally, we have 

po(X) < lc f =>■ po(X) < lc T 1/n (f* n ), (23) 

which is a reversal of (|2TT >. To prove (l23l , we note that, 
according to a result of Davenport and Polya IfTTI . po(X) <i c f 
implies po(X) <i c f* n . By a slight modification of the 
argument of Johnson ( |fl9l . Proposition 3.7), we can also show 
that po(X) <i c f implies po(X) <i c T a (f) (details omitted); 
thus ([23J holds. 

VII. Rate of convergence 

Assuming that / is a pmf on Z + with mean A and variance 
a 2 < 00, Harremoes et al. ( 031 , Corollary 9) show that 

^(T 1/ „(/™))<- + 4- 
n n\ 

That is, the relative entropy converges at a rate of (at least) 
(^(n" 1 ). We aim to improve this to 0(n~ 2 ) under some 
natural assumptions. The 0(n~ 2 ) rate is perhaps not surprising 
since, in the binomial case ( IfTTI ). 

D(bi(n, X/n)) = 0(n~ 2 ), n -> 00. (24) 

We first use the stochastic orders < cx and <; c to extend 
(1241 to ULC distributions. 

Theorem 8: If / is ULC on Z + with mean A, then 

D(T 1/n (f* n )) <{nX} D(bi([nX\ + 1, l/ra)|po(A)) 

+ (1 - {nX})D(U([nX\,l/n)\po(X)) 

(25) 



where {x} and [^J denote the fractional and integer parts of 
x, respectively. 

Theorem |8] and d24l i easily yield 

D(T 1/n (f* n ))=0(n- 2 ), n^oo, 

as long as / is ULC. To prove Theorem [8] we again adopt the 
strategy of Section VI. Proposition [8] is a variant of Lemma |4] 
Proposition 8: Let / and g be pmfs on Z + such that / < cx 
g and g is ULC. Then 

D(f)>D(g)+D(f\g). 

We also have the following result, which is easily deduced 
from Theorem 3. A. 13 of Shaked and Shanthikumar [30] (see 
also [ 39 1, Lemma 2). Plainly, it says that the convex order < cx 
is preserved under thinning. 

Proposition 9: If / and g are pmfs on Z + such that / < cx 
g, then T a f < cx T a g, a e (0, 1). 

Proof of Theorem |S} Let g be the two-point pmf that 
assigns probability {nX} to ["-^J + 1 an d the remaining 
probability to |_tt. AJ . Note that the mean of g is nX. Also, 
the relation g < cx f* n is intuitive and easily proven. Indeed, 
if (j> is a convex function on Z + , then 

4>{x) >{x- \nX\)(f>{\nX\ + 1) + (|nAJ + 1 - x)4>{\nX\). 

The claim follows by taking the weighted average with respect 
to /*«. By Proposition |U T 1/n g < cx T 1/n (/*"). Since / is 
ULC, so is T 1/n (f* n ). By Proposition i D{T 1/n {f* n )) < 
D(Ti/ n g). However Tx/ n g is a mixture of two binomials: 

Ti/„g = {nX} bi{[nX\ + 1, 1/n) + (1 - {nX})bi([nX\ ,1/n). 

Thus d25l l holds by the convexity of the relative entropy. ■ 
Although (f25j implies the right order of the convergence 
rate, the bound itself does not involve the variance of /. It is 
known that, if / is ULC, then its variance a 2 does not exceed 
its mean A ( fl9l , [36 1). It is intuitively reasonable that the 
closer a 2 is to A, the smaller D(f) and D(T x / n {f* n )) are. 
Hence any bound that accounts for the variance a 2 would be 
interesting. 

Of course, it would also be interesting to see the ULC 
assumption relaxed. Theorem [9] shows that the 0(n~ 2 ) rate 
holds under a finite support assumption. Note that, in the CLT 
case, an 0(n _1 ) rate of convergence for the relative entropy 
can be obtained under a "spectral gap" assumption ([ 1 ], |20|); 
possibly a similar assumption suffices in our case. Under the 
finite support assumption, however, the proof of Theorem [9] 
is elementary, although it does use a nontrivial subadditivity 
property of the scaled Fisher information ( 11231 . Il27l ). 

Theorem 9: Suppose / is a pmf on Z + with finite support 
and denote the mean and variance of / by A and a 2 respec- 
tively. Then 

D(T 1/n (f* n )) = 0(n- 2 ), 71^00. (26) 

If A = a 2 in addition, then the right hand side of d26l > can be 
replaced by 0(n~ 3 ). 

Proof: Let us assume A > to eliminate the triv- 
ial case. For a pmf g on Z + with mean n > 0, define 
K(g) = ^X 2 (S(g),g) as in O- Madiman et al. ( |27l . 
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Theorem III) show that K(g* n ) decreases in n. In particular, 
letting g — Ti/ n (f), and noting (fT~2T > and (|2), we obtain 

D(T 1/n (r n )) < K{T 1/n {r n )) < K(T 1/n (f))- 

Thus, to prove d26j, we only need K(T 1/n (f)) = 0(n~ 2 ). 
By the definition of K(-) and ©, this is equivalent to 

X 2 m /n (5(/)),T 1/n (/)) = 0{n- x ). (27) 

However, for each i > we have 

k 

(2i/„(/))< = 2/ i W(i;i,l/n) 



where fc is the largest integer such that ^ 0; a similar 
expression holds for T 1 / n (S(f)). By direct calculation, each 
term in the sum 



E 

i=0 



((T 1/n (S(f))) t -(T 1/n (f)W 



(?!/»(/))< 



(28) 



is 0(n _1 ), and d27l i holds. If A = a 2 , then each term in 
is 0(n~ 2 ), thus proving the remaining claim. ■ 
Theorems [8] and [9] imply a corresponding rate of conver- 
gence for the total variation distance, which is defined as 
V(g, g) = J2i 9i\ f° r an y pmfs 9 an d 9- The total variation 
is related to the relative entropy via Pinsker's inequality 
y 2 {9i9) < 1D(g\g). Hence, if / is either ULC or has finite 
support, then 

V{T x/n {D,po{\)) = Oin- 1 ). 

An explicit upper bound, possibly via the Stein-Chen method, 
is of course desirable. 

VIII. Summary and possible extensions 

We have extended the monotonicity of entropy in the central 
limit theorem to a version of the law of small numbers, 
which involves the thinning operation (the discrete analogue 
of scaling), and a Poisson limit (the discrete counterpart of the 
normal). For a pmf / on Z + with mean A, we show that the 
relative entropy D(T 1 / n (f* n )\po{\)) decreases monotonically 
in n (Theorem |2]i, and, if / is ultra-log-concave, the entropy 
H(Tx / n (/*")) increases in n (Theorem |3). In the process of 
establishing Theorem [2] inequalities are obtained for the rela- 
tive entropy under thinning and convolution, and connections 
are made with logarithmic Sobolev inequalities and with the 
recent results of Kontoyiannis et al. 11231 and Madiman et al. 
[27 1. Theorem[3] in contrast, is established by comparing pmfs 
with respect to the convex order, an idea that dates back to 
Karlin and Rinott ll22l . 

This work is arguably more qualitative than quantitative, 
given its focus on monotonicity. When bounds are occasionally 
obtained, in Proposition [2] for example, we do not claim that 
they are always sharp. Among the large literature on Poisson 
approximation bounds (e.g., Barbour et al. (3]), the use of 
information theoretic ideas is a relatively new development 



([23 1, [27 1). We have, however, obtained an upper bound and 
identified an 0(n~ 2 ) rate for the relative entropy under certain 
simple conditions. Such results complement those of [15] and 

m. 

The analogy with the CLT leads to further questions. For ex- 
ample, given the intimate connection between the information- 
theoretic CLT with Shannon's entropy power inequality (EPI), 
it is natural to ask whether there exists a discrete version of 
the EPI. By analogy with the CLT, our results seem to suggest 
that the answer is yes, although there is still much to be done. 
Certain simple formulations of the EPI do not hold in the 
discrete setting; see [41 1 for recent developments. 

We may also consider extending our monotonicity results 
to compound Poisson limit theorems. Recently, Johnson et al. 
[21 1 (see also |39|) have shown that compound Poisson dis- 
tributions admit a maximum entropy characterization similar 
to that of the Poisson. Such results suggest the possibility of 
compound Poisson limit theorems with the same appealing 
"entropy increasing to the maximum" interpretation. 

Finally, on a more technical note, we point out a possible 
refinement of Theorem [2] This is analogous to the results 
of Yu ||40l , who noted that relative entropy is completely 
monotonic in the CLT for certain distribution families. (A 
function is completely monotonic if its derivatives of all orders 
exist and alternate in sign; the definition is similar for discrete 
sequences; see Feller [12 | for the precise statements.) 

Theorem 10 (00)): Let X h i = 1,2,..., be i.i.d. ran- 
dom variables with distribution F, mean fi, and variance 
a 2 S (0,oo). Then D fe=iPQ - /x)/V^|N(0, 1)) is a 
completely monotonic function of n if F is either a gamma 
distribution or an inverse Gaussian distribution. 

Part of the reason that the gamma and inverse Gaus- 
sian distributions are considered is that they are analytically 
tractable. The result may conceivably hold for a wide class of 
distributions. We conclude with a discrete analogue based on 
numerical evidence. 

Conjecture 1: Let A > 0. Then 

• D(bi(n, A/n)) is completely monotonic in n (n > A); 

• D(nb(n, n/(X + n))) is completely monotonic in n (n > 
0). 

We again expect similar results for other pmfs, but are 
unable to prove even those for the binomial and the negative 
binomial. 

Appendix 
Proof of Proposition[6] 

Let us recall a well-known characterization of the convex 
order (see ll30l . Theorem 3.A.1, for example). 

Proposition 10: Let X and Y be random variables on Z + 
such that EX = EY < oo. Then X < cx Y if and only if 

Emax{X-k, 0} < Emax{Y-k, 0}, k > 0, 

or, equivalently, 

^Pr(X >i) < ^Pr(Y >i), k>0. 



i>k 



i>k 
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Proposition 11: Fix p S (0, 1), and let Y x and Y 2 be i.i.d. 
random variables on Z + with an ultra-log-concave pmf /. Let 
Zt, Z2, Z[ and Z' 2 be independent conditional on Y\ and Y 2 
and satisfy 

Zi \Y X ~ Bi(Fi , p + <5), z 2 1 r 2 ~ Bi(F 2 , p - 8) , 
Z(|y 1 ~Bi(y 1) p + <T), Z 2 |r 2 ^Bi(r 2; p-5')- 

If 8 > 5' > 0, then Z x + Z 2 < CK Z[ + Z 2 . 

Proq/: We show that, for each k > 0, S i>Jfc Pr (Zi + 
Z 2 > i) is a decreasing function of 8 as long as < 8 < 
min{p, 1 — p}. The claim then follows from Proposition [TOl 
(the assumptions imply E(Z X + Z 2 ) = E(Z[ + Z' 2 ) < 00). To 
simply the notation, in what follows the limits of summation, 
if not spelled out, are from —00 to 00; also fi = if i < 0. 
Denoting B(i;n,p) — J2j >i bi(j;n,p), and letting h(S) = 
J2 t>k Pr(Zi + Z 2 > i), weTiave 



* - J) 



i>k 3 



:^Pr(Z!>i)Pr(Z 2 > fc-i) 



j |_s>0 

: J2 fsfMs,t,S) 

s,t>Q 



J2fsB(k-j;s,p-6) 
(29) 



s>0 



where 



i, 5) = B ti; s,P + S)B(k - j; t,p- 8). 



Using the simple identity 

dB(i; n,p) 



we get 



dp 



dv(s, t. 8) 
dS 



= n[bi(i — 1; n — l,p)] 



su(s, t, 8) — tu(t, s, —5) 



where 



i(s, t,s) = J2 bi U - !; s - !iP + W fc - i; *.p - 

= 53W(*-i-l;s-l,p + 5)B(i;t,p-<y). 



The quantity u(s,t,6) has the following interpretation. If we 

let V x ~ Bi(s — l,p + <5) and V2 ~ Bi(t,p — 8) independently, 
then u(s, t, 8) = Pr(Vi + V 2 > k - 1). Clearly 



Hence, we may take the derivative under the summation in 
(by dominated convergence), and then apply (f30b to obtain 

^ - £ sfJ t u(s,t,S)- tf s f t u(t,s,-S) 

s,t>0 s,t>0 

= Yj S /s/* U ( s ;M) 
s>l,t>0 

- (t + l)fs-ift+iu(t + l,s-l,-6) 

s>l,t>0 

= Y [sfsft-(t+l)fs-ift+iHs,t,S). (31) 

s>l,t>0 

By a change of variables s — > i + 1 and £ — * s — 1 in OTb . 
and by d30i >. we get 

^= Y [(t + l)ft+ifs-i-sf s fMs,t,-6). (32) 

s>l,t>0 

Combining d3Tl > and (l32l . and noting the symmetry, we obtain 

= E [sfsft-(t+m + ifs-i}Hs,t,6)-u(s,t,-6)}. 

(33) 



dS 



l<s<t 

Because / is ULC, if s < t, then 

afsft > (t + l)ft+i.fs-i- 
We can also show (s < t) 

u(s, t, 8) < u(s, t, —8) 



(34) 



(35) 



as follows. Let W%, W 2 , W3, W4 be independent random 
variables such that 

Wi~Bi(s-l,p + 5), W 2 ~Bi(s-l,p-8), 
W 3 - Bi(t -s + l,p + 8), Bi{t -s + l,p-S). 

Then 

u(s,t,8) = Pr(Wi + Wi + W 4 > fc- 1); 
M(s,t,-<5) = Pr(VKi + W 2 + W 3 >k-l). 

Since 5 > 0, we have W4 < st W3, which yields W\ + VK 2 + 
W 4 < st W x +W 2 + Wa, and u(s,t,8) < u(s,t,-S) by the 
definition of < st . Now (|33), ® and (|35) give 



dft(5) 
dS 



< 



u(s, t, 6) = u(t + 1, s — 1, —J). 



(30) 



i.e., ft,((5) decreases in 5. ■ 
Proof of Proposition [6} Given the basic properties of 
majorization, we only need to prove that E(f>(%2?_i Z) is 
Schur concave as a function of {pi,p 2 ) holding P3, ■■ ■ ,p n 
fixed. Define ip(z) = E(f)(z + ^2™_ 3 Zi). Since is convex, so 
is -0- (We may assume that ip is finite as the general case can 
be handled by a standard limiting argument.) Proposition [TT] 
however, shows precisely that Eip(Zi + Z 2 ) = E(j>(%2 i=1 Zi) 
is Schur-concave in (p x ,p 2 ). ■ 
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